-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8474: Add Daffodil Format Plugin #2836
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbeckerle
I mistakenly pushed some code cleanup I did directly to your branch. I apologize for that. In any event, I added some comments to the BatchReader and FormatPlugin which I think will help you get unblocked.
dafParser.setInfosetOutputter(outputter); | ||
// Lastly, we open the data stream | ||
try { | ||
dataInputStream = dataInputURI.toURL().openStream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm not sure why we need to do this. Drill can get you an input stream of the input file.
All you need to do is:
dataInputStream = negotiator.file().fileSystem().openPossiblyCompressedStream(negotiator.file().split().getPath());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the data files this works.
For schemas, this will not be a solution even temporarily. Daffodil loads schemas from the classpath. Large schemas are complex objects, akin to a software system with dependencies expressed via XML Schema include/import statements with schemaLocation attributes that contain relative URLs or "absolute" URLs where absolute means relative to some root of some jar file on the classpath.
Even simple DFDL schemas are routinely spread over a couple jars.
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Show resolved
Hide resolved
...b/format-daffodil/src/test/java/org/apache/drill/exec/store/daffodil/TestDaffodilReader.java
Outdated
Show resolved
Hide resolved
...b/format-daffodil/src/test/java/org/apache/drill/exec/store/daffodil/TestDaffodilReader.java
Outdated
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Outdated
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Outdated
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Outdated
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Show resolved
Hide resolved
...ffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
Outdated
Show resolved
Hide resolved
@mbeckerle Looks like you're making good progress! |
...il/src/main/java/org/apache/drill/exec/store/daffodil/schema/DrillDaffodilSchemaVisitor.java
Outdated
Show resolved
Hide resolved
...b/format-daffodil/src/test/java/org/apache/drill/exec/store/daffodil/TestDaffodilReader.java
Show resolved
Hide resolved
...ib/format-daffodil/src/test/java/org/apache/drill/exec/store/daffodil/TestDaffodilUtils.java
Outdated
Show resolved
Hide resolved
exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatCreator.java
Outdated
Show resolved
Hide resolved
eb418bf
to
c36cc07
Compare
...ffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
Outdated
Show resolved
Hide resolved
...-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/framework/SchemaNegotiator.java
Outdated
Show resolved
Hide resolved
96e602e
to
bf5e16c
Compare
9b01eb0
to
7e77f19
Compare
This is pretty much working now, in terms of constructing drill metadata from DFDL schemas, and There were dozens of commits to get here, so I squashed them as they were no longer helpful. Obviously more test are needed, but the ones there show nested subrecords working. The issues like how schemas get distributed, and how Daffodil gets invoked in parallel by drill are still open. |
3.7.0-SNAPSHOT of Daffodil which has metadata support we're using. New format-daffodil module created Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.) We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved. The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice. Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects. Test show this works for data as complex as having nested repeating sub-records. These DFDL types are supported: - int - long - short - byte - boolean - double - float (does not work. Bug DAFFODIL-2367) - hexBinary - string apache#2835
7e77f19
to
ca709af
Compare
Rebased onto latest Drill master as of 2023-12-21 (force pushed one more time) Note that this is never going to pass automated tests until the Daffodil release this depends on is official (currently it needs a locally build Daffodil 3.7.0-snapshot, though the main daffodil branch has the changes integrated so any 3.7.0-snapshot build will work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Mike,
This is looking good. I have some minor comments, mostly formatting. It seems like the next step would be to figure out where and how we store the DFDL files.
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Outdated
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Outdated
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Show resolved
Hide resolved
...b/format-daffodil/src/test/java/org/apache/drill/exec/store/daffodil/TestDaffodilReader.java
Outdated
Show resolved
Hide resolved
...ffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
Outdated
Show resolved
Hide resolved
...ffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
Outdated
Show resolved
Hide resolved
...ffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
Outdated
Show resolved
Hide resolved
...ffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
Outdated
Show resolved
Hide resolved
...ormat-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilMessageParser.java
Outdated
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Outdated
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Show resolved
Hide resolved
.../format-daffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilBatchReader.java
Show resolved
Hide resolved
extends InfosetOutputter { | ||
|
||
private boolean isOriginalRoot() { | ||
boolean result = currentTupleWriter() == rowSetWriter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the Drill coding style defined in a wiki or other doc page somewhere? I didn't find one.
If this is just java-standard, then I need reminding, as I have not coded Java prior to this effort for 12+ years now.
...ffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
Outdated
Show resolved
Hide resolved
...ffodil/src/main/java/org/apache/drill/exec/store/daffodil/DaffodilDrillInfosetOutputter.java
Outdated
Show resolved
Hide resolved
@cgivre yes, the next architectural-level issue is how to get a compiled DFDL schema out to everyplace Drill will run a Daffodil parse. Every one of those JVMs needs to reload it. I'll do the various cleanups and such. The one issue I don't know how to fix is the "typed setter" vs. (set-object) issue, so if you could steer me in the right direction on that it would help. |
Hi Mike,
Just jumping in with a random thought. Drill has accumulated a number of
schema systems: Parquet metadata cache, HMS, Drill's own metastore,
"provided schema", and now DFDL. All provide ways of defining data: be it
Parquet, JSON, CSV or whatever. One can't help but wonder, should some
future version try to reduce this variation somewhat? Maybe map all the
variations to DFDL? Map DFDL to Drill's own mechanisms?
Drill uses two kinds of metadata: schema definitions and file metadata used
for scan pruning. Schema information could be used at plan time (to provide
column types), but certainly at scan time (to "discover" the defined
schema.) File metadata is used primarily at plan time to work out how to
distribute work.
A bit of background on scan pruning. Back in the day, it was common to have
thousands or millions of files in Hadoop to scan: this was why tools like
Drill were distributed: divide and conquer. And, of course, the fastest
scan is to skip files that we know can't contain the information we want.
File metadata captures this information outside of the files themselves.
HMS was the standard solution in the Hadoop days. (Amazon Glue, for S3, is
evidently based on HMS.)
For example, Drill's Parquet metadata cache, the Drill metastore and HMS
all provide both schema and file metadata information. The schema
information mainly helped with schema evolution: over time, different files
have different sets of columns. File metadata provides information *about*
the file, such as the data ranges stored in each file. For Parquet, we
might track that '2023-01-Boston.parquet' has data from the office='Boston'
range. (So, no use scanning the file for office='Austin'.) And so on.
With Hadoop HFS, it was customary to use directory structure as a partial
primary index: our file above would live in the /sales/2023/01 directory,
for example, and logic chooses the proper set of directories to scan. In
Drill, it is up to the user to add crufty conditionals on the path name. In
Impala, and other HMS-aware tools, the user just says WHERE order_year =
2023 AND order_month = 1, and HMS tells the tool that the order_year and
order_month columns translate to such-and-so directory paths. Would be nice
if Drill could provide that feature as well, given the proper file
metadata: in this case, the mapping of column names to path directories and
file names.
Does DFDL provide only schema information? Does it support versioning so
that we know that "old.csv" lacks the "version" column, while "new.csv"
includes that column? Does it also include the kinds of file metadata
mentioned above?
Or, perhaps DFDL is used in a different context in which the files have a
fixed schema and are small in number? This would fit well the "desktop
analytics" model that Charles and James suggested is where Drill is now
most commonly used.
The answers might suggest if DFDL can be the universal data description. or
if DFDL applies just to individual file schemas, and Drill would still need
a second system to track schema evolution and file metadata for large
deployments.
Further, if DFDL is kind of a stand-alone thing, with its own reader, then
we end up with more complexity: the Drill JSON reader and the DFDL JSON
reader. Same for CSV, etc. JSON is so complex that we'd find ourselves
telling people that the quirks work one way with the native reader, another
way with DFDL. Plus, the DFDL readers might not handle file splits the same
way, or support the same set of formats that Drill's other readers support,
and so on. It would be nice to separate the idea of schema description from
reader implementation, so that DFDL can be used as a source of schema for
any arbitrary reader: both at plan and scan times.
If DFDL uses its own readers, then we'd need DFDL reader representations in
Calcite, which would pick up DFDL schemas so that the schemas are reliably
serialized out to each node as part of the physical plan. This is possible,
but it does send us down the two-readers-for-every-format path.
On the other hand, if DFDL mapped to Drill's existing schema description,
then DFDL could be used with our existing readers and there would be just
one schema description sent to readers: Drill's existing provided schema
format that EVF can already consume. At present, just a few formats support
provided schema in the Calcite layer: CSV for sure, maybe JSON?
Any thoughts on where this kind of thing might evolve with DFDL in the
picture?
Thanks,
- Paul
…On Tue, Jan 2, 2024 at 8:00 AM Mike Beckerle ***@***.***> wrote:
@cgivre <https://github.com/cgivre> yes, the next architectural-level
issue is how to get a compiled DFDL schema out to everyplace Drill will run
a Daffodil parse. Every one of those JVMs needs to reload it.
I'll do the various cleanups and such. The one issue I don't know how to
fix is the "typed setter" vs. (set-object) issue, so if you could steer me
in the right direction on that it would help.
—
Reply to this email directly, view it on GitHub
<#2836 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYZF4MFVRCUYDCKJYSKKYTYMQVLFAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGIYTGNZYGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Date, Time, DateTime, Boolean, Unsigned integers, Integer, NonNegativeInteger,Decimal, float, double, hexBinary.
Let me respond between the paragraphs....
On Tue, Jan 2, 2024 at 11:49 PM Paul Rogers ***@***.***>
wrote:
Hi Mike,
Just jumping in with a random thought. Drill has accumulated a number of
schema systems: Parquet metadata cache, HMS, Drill's own metastore,
"provided schema", and now DFDL. All provide ways of defining data: be it
Parquet, JSON, CSV or whatever. One can't help but wonder, should some
future version try to reduce this variation somewhat? Maybe map all the
variations to DFDL? Map DFDL to Drill's own mechanisms?
Well we can dream can't we :-)
I can contribute the ideas in
https://daffodil.apache.org/dev/design-notes/Proposed-DFDL-Standard-Profile.md
which
is an effort to restrict the DFDL language so that schemas written in DFDL
can work more smoothly with Drill, NiFi, Spark, Flink, Beam, etc. etc.
DFDL's data model is too restrictive to be "the model" for Drill since
Drill wants to query even unstructured data like XML without schema. DFDL's
data model is targeted only at structured data.
Drill's data model and APIs seem optimized for streaming block-buffered
top-level rows of data (the EVF API does anyway). Top level row-sets are
first-class citizens, as are the fields of said rows. Fields containing
arrays of maps (possibly containing more arrays of maps, and so on deeply
nested) are not handled uniformly with the same block-buffered "row-like"
mechanisms. The APIs are similar, but not polymorphic. I suspect that the
block-buffered data streaming in Drill only happens for top-level rows,
because there is no test for whether or not you are allowed to create
another array item like there is a test for creating another row in a
row-set writer. There is no control inversion where an adapter must give
back control to Drill in the middle of trying to write an array.
The current Drill/Daffodil interface I've created doesn't cope with
header-body* files (ex: PCAP which format has a header record, then
repeating packet records) as it has no way of returning just the body
records as top level rows. So while there exists a DFDL schema for PCAP,
you really do want to use a dedicated PCAP Drill adapter which hands back
rows, not Daffodil which will parse the entire PCAP file into one huge row
containing a monster sub-array of packets, where each packet is a map
within the array of maps. This is ok for now as many files where DFDL is
used are not like PCAP. They are just repeating records of one format with
no special whole-file header. Eventually we will want to be able to supply
a path to tell the Drill/Daffodil interface that you only want the packet
array as the output rows. (This is the unimplemented Daffodil "onPath(...)"
API feature. We haven't needed this yet for DFDL work in cybersecurity, but
it was anticipated 10+ years back as essential for data integration.)
Drill uses two kinds of metadata: schema definitions and file metadata
used
for scan pruning. Schema information could be used at plan time (to
provide
column types), but certainly at scan time (to "discover" the defined
schema.) File metadata is used primarily at plan time to work out how to
distribute work.
DFDL has zero notion of file metadata. It doesn't know whether data even
comes from a file or an open TCP socket. Daffodil/DFDL just sees a
java.io.InputStream.
The schema it uses for a given file is specified by the API call. Daffodil
does nothing itself to try to find or identify any schema.
So we're "blank slate" on this issue with DFDL.
A bit of background on scan pruning. Back in the day, it was common to
have
thousands or millions of files in Hadoop to scan: this was why tools like
Drill were distributed: divide and conquer. And, of course, the fastest
scan is to skip files that we know can't contain the information we want.
File metadata captures this information outside of the files themselves.
HMS was the standard solution in the Hadoop days. (Amazon Glue, for S3, is
evidently based on HMS.)
For example, Drill's Parquet metadata cache, the Drill metastore and HMS
all provide both schema and file metadata information. The schema
information mainly helped with schema evolution: over time, different
files
have different sets of columns. File metadata provides information *about*
the file, such as the data ranges stored in each file. For Parquet, we
might track that '2023-01-Boston.parquet' has data from the
office='Boston'
range. (So, no use scanning the file for office='Austin'.) And so on.
With Hadoop HFS, it was customary to use directory structure as a partial
primary index: our file above would live in the /sales/2023/01 directory,
for example, and logic chooses the proper set of directories to scan. In
Drill, it is up to the user to add crufty conditionals on the path name.
In
Impala, and other HMS-aware tools, the user just says WHERE order_year =
2023 AND order_month = 1, and HMS tells the tool that the order_year and
order_month columns translate to such-and-so directory paths. Would be
nice
if Drill could provide that feature as well, given the proper file
metadata: in this case, the mapping of column names to path directories
and
file names.
The above all makes perfect sense to me, and DFDL schemas are completely
orthogonal to this.
If a file naming convention tells *Drill* that it doesn't need to open and
parse some data using Daffodil, great, then *Drill* will not invoke
Daffodil to do so.
DFDL/Daffodil doesn't know nor care about this.
Does DFDL provide only schema information? Does it support versioning so
that we know that "old.csv" lacks the "version" column, while "new.csv"
includes that column? Does it also include the kinds of file metadata
mentioned above?
DFDL only provides structural schema information.
Data formats do versioning in a wide variety of ways, so DFDL can't take
any position on how this is done, but many DFDL schemas adapt to multiple
versions of the data formats they describe based on the existence of
different fields or values of those fields. This can only work for formats
where there are data fields that identify the versions.
But nothing based on file metadata.
Or, perhaps DFDL is used in a different context in which the files have a
fixed schema and are small in number? This would fit well the "desktop
analytics" model that Charles and James suggested is where Drill is now
most commonly used.
The cybersecurity use case is one of the prime motivators for DFDL work.
Often the cyber gateways are file movers, files arrive spontaneously in
various locations, and are moved across the cyber boundary.
The use cases continue to grow in scale, and some people use Apache NiFi
with DFDL for large scale such file moving.
Unlike Drill, these use cases all parse and then re-serialize the data
after extensive validation and rule-based filtering.
The same sort of file-metadata based stuff - ex: rules like all the files
in this directory named X with extension ".dat" use schema S - all applies
in the cyber-gateway use case.
Apache Daffodil doesn't know anything about this cyber use case however,
nor anything about data integration. Daffodil is actually a quite narrow
library. Stays in its lane.
The answers might suggest if DFDL can be the universal data description.
or
if DFDL applies just to individual file schemas, and Drill would still
need
a second system to track schema evolution and file metadata for large
deployments.
Yeah. Drill needs a separate system for this. Not at all a DFDL-specific
issue. DFDL/Daffodil take no position on schema evolution.
However, to Daffodil devs, a DFDL schema is basically source code. We keep
them in git. They have releases. We package them in jars and use managed
dependency tools to grab them from repositories the same way java code jars
are grabbed by maven.
One of my concerns about metadata repositories/registries is that they are
not thought of as configuration management systems. But DFDL schemas are
certainly large formal objects that require configuration management.
For example, the VMF schema we have is over 180K lines of DFDL "code",
spread over hundreds of files. It is actually an assembly composed of
specific versions of 4 different smaller DFDL schemas and the large corpus
of VMF-specific schema files. There is documentation, analysis reports,
etc. that go along with it.
So some sort of repository that makes specific schemas available to Drill
makes sense, but cannot be confused with the configuration management
system.
I quite literally just got a Maven Central/Sonotype account yesterday so
that I can push some DFDL schemas up to maven central so they can be reused
from there via jars.
Further, if DFDL is kind of a stand-alone thing, with its own reader, then
we end up with more complexity: the Drill JSON reader and the DFDL JSON
reader. Same for CSV, etc. JSON is so complex that we'd find ourselves
telling people that the quirks work one way with the native reader,
another
way with DFDL. Plus, the DFDL readers might not handle file splits the
same
way,
Daffodil knows no concept of "file splits". It doesn't even know about
files actually. It's just an input byte stream. literally a
java.io.InputStream.
or support the same set of formats that Drill's other readers support,
and so on. It would be nice to separate the idea of schema description
from
reader implementation, so that DFDL can be used as a source of schema for
any arbitrary reader: both at plan and scan times.
The DFDL/Drill integration converts DFDL-described data directly to Drill
with no intermediate form like XML nor JSON. One hop. E.g.,
drillScalaWriter.setInt(daffodilInfosetElement.getInt());
There is no notion of Daffodil "also" reading JSON. You wouldn't parse JSON
with DFDL typically. You would use a JSON library and hopefully a JSON
schema that describes the JSON.
Ditto for XML, Google protocol buffers, Avro, etc.
If DFDL uses its own readers, then we'd need DFDL reader representations in
DFDL is a specific reader, this notion of "its own readers" doesn't apply.
Calcite, which would pick up DFDL schemas so that the schemas are reliably
serialized out to each node as part of the physical plan. This is
possible,
but it does send us down the two-readers-for-every-format path.
On the other hand, if DFDL mapped to Drill's existing schema description,
then DFDL could be used with our existing readers
I don't get "DFDL used with existing readers".... by "with" you mean
"along-side" or "incorporating"?
and there would be just
one schema description sent to readers: Drill's existing provided schema
format that EVF can already consume. At present, just a few formats
support
provided schema in the Calcite layer: CSV for sure, maybe JSON?
This is what we need. The Daffodil/Drill integration walks DFDL metadata
and creates Drill metadata 100% in advance and this should, I think,
automatically find its way to all the right places without anything else
being needed beyond today's Drill behavior.
But besides Drill's metadata the Daffodil execution at each node needs to
load up the compiled DFDL schema. That object, which can be several
megabytes of stuff. Needs to find its way out to all the nodes that need
it. This I have no idea how we make happen.
…
Any thoughts on where this kind of thing might evolve with DFDL in the
picture?
Thanks,
- Paul
On Tue, Jan 2, 2024 at 8:00 AM Mike Beckerle ***@***.***>
wrote:
> @cgivre <https://github.com/cgivre> yes, the next architectural-level
> issue is how to get a compiled DFDL schema out to everyplace Drill will
run
> a Daffodil parse. Every one of those JVMs needs to reload it.
>
> I'll do the various cleanups and such. The one issue I don't know how to
> fix is the "typed setter" vs. (set-object) issue, so if you could steer
me
> in the right direction on that it would help.
>
> —
> Reply to this email directly, view it on GitHub
> <#2836 (comment)>, or
> unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AAYZF4MFVRCUYDCKJYSKKYTYMQVLFAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUGIYTGNZYGA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#2836 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALUDA4H366DXIG2RATIV4TYMTPLHAVCNFSM6AAAAAA576F7J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZUHA2DKMRXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I imported the dev-support/formatter/eclipse settings and used them to reformat the code in IntelliJ IDEA. No functional changes in this commit.
Also a few code cleanups.
This is ready for a next review. All the scalar types are now implemented with typed setter calls. The prior review comments have all been addressed I believe. Remaining things to do include:
|
@mbeckerle I had a thought about your TODO list. See inline.
I was thinking about this and I remembered something that might be useful. Drill has support for User Defined Functions (UDF) which are written in Java. To add a UDF to Drill, you also have to write some Java classes in a particular way, and include the JARs. Much like the DFDL class files, the UDF JARs must be accessible to all nodes of a Drill cluster. Additionally, Drill has the capability of adding UDFs dynamically. This feature was added here: #574. Anyway, I wonder if we could use a similar mechanism to load and store the DFDL files so that they are accessible to all Drill nodes. What do you think?
|
Excellent: So drill has all the machinery, it's just a question of repackaging it so it's available for this usage pattern, which is a bit different from Drill's UDFs, but also very similar. There are two user scenarios which we can call production and test.
Kinds of objects involved are:
Code jars: Daffodil provides two extension features for DFDL users - DFDL UDFs and DFDL 'layers' (ex: plug-ins for uudecode, or gunzip algorithms used in part of the data format). Those are ordinary compiled class files in jars, so in all scenarios those jars are needed on the node class path if the DFDL schema uses them. Daffodil dynamically finds and loads these from the classpath in regular Java Service-Provider Interface (SPI) mechanisms. Schema jars: Daffodil packages DFDL schema files (source files i.e., mySchema.dfdl.xsd) into jar files to allow inter-schema dependencies to be managed using ordinary jar/java-style managed dependencies. Tools like sbt and maven can express the dependencies of one schema on another, grab and pull them together, etc. Daffodil has a resolver so when one schema file referenes another with include/import it searches the class path directories and jars for the files. Schema jars are only needed centrally when compiling the schema to a binary file. All references to the jar files for inter-schema file references are compiled into the compiled binary file. It is possible for one DFDL schema 'project' to define a DFDL schema, along with the code for a plugin like a Daffodil UDF or layer. In that case the one jar created is both a code jar and a schema jar. The schema jar aspects are used when the schema is compiled and ignored at Daffodil runtime. The code jar aspects are used at Daffodil run time and ignored at schema compilation time. So such a jar that is both code and schema jar needs to be on the class path in both places, but there's no interaction of the two things. Binary Compiled Schema File: Centrally, DFDL schemas in files and/or jars are compiled to create a single binary object which can be reloaded in order to actually use the schema to parse/unparse data.
Daffodil Config File: This contains settings like what warnings to suppress when compiling and/or at runtime, tunables, such as how large to allow a regex match attempt, maximum parsed data size limit, etc. This also is needed both at schema compile and at runtime, as the same file contains parameters for both DFDL schema compile time and runtime. |
...il/src/main/java/org/apache/drill/exec/store/daffodil/schema/DrillDaffodilSchemaVisitor.java
Outdated
Show resolved
Hide resolved
...odil/src/main/java/org/apache/drill/exec/store/daffodil/schema/DrillDaffodilSchemaUtils.java
Outdated
Show resolved
Hide resolved
|
||
private void loadSchema(URI schemaFileURI) throws IOException, InvalidParserException { | ||
Compiler c = Daffodil.compiler(); | ||
dp = c.reload(Channels.newChannel(schemaFileURI.toURL().openStream())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgivre This reload call is the one that has to happen on every drill node.
It needs only to happen once for that schema for the life of the JVM. The "dp" object created here can be reused every time that schema is needed to parse more data. The dp (DataProcessor) is a read only (thread safe) data structure.
As you see, this can throw exceptions, so the question of how those situations should be handled arises.
Even if drill perfectly makes the file available to every node for this, that would rule out the IOException due to file not found or access rights, but a user can create a compiled DFDL schema binary file using the wrong version of the Daffodil schema compiler which is a mismatch for the runtime; hence, it is possible for the InvalidParserException to be thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely seems like an area where there is potential for a lot of different things to go wrong. My view is we should just do our best to provide clear error messages so that the user can identify and fix the issues.
try { | ||
dmp.loadSchema(schemaFileURI); | ||
} catch (IOException | InvalidParserException e) { | ||
throw new CompileFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error architecture?
This loadSchema call needs to happen on every node, and so has the potential (if the loaded binary schema file is no good or mismatches the Daffodil library version) to fail. Is throwing this exception the right thing here or are other steps preferred/necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought here would be to fail as quickly as possible. If the DFDL schema can't be read, I'm assuming that we cannot proceed, so throwing an exception would be the right thing to do IMHO. With that said, we should make sure we provide a good error message that would explain what went wrong.
One of the issues we worked on for a while with Drill was that it would fail and you'd get a stack trace w/o a clear idea of what the actual issue is and how to rectify it.
.../src/main/java/org/apache/drill/exec/store/daffodil/schema/DaffodilDataProcessorFactory.java
Show resolved
Hide resolved
.addContext(errorContext).build(logger); | ||
} | ||
if (dafParser.isValidationError()) { | ||
logger.warn(dafParser.getDiagnosticsAsString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an option here to convert validation errors to fatal?
Will logger.warn be seen by a query user, or is that just for someone dealing with the logs?
Validation errors either should be escalated to fatal, OR they should be visible in the query output display to a user somehow.
Either way, users will need a mechanism to suppress validation errors that prove to be unavoidable since they could be common place. Nodody wants thousands of warnings about something they can't avoid that doesn't stop parsing and querying the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbeckerle The question I'd have is whether the query can proceed if validation fails. (I don't know the answer)
If the answer is no, then we need to halt execution ASAP and throw an exception. If the answer is it can proceed, but the data might be less than ideal, maybe we add a configuration option which will allow the user to decide the behavior on a validation failure.
I could imagine situations where you have Drill unable to read a huge file because someone fat fingered a quotation mark somewhere or something like that. In a situation like that, sometimes you might just want to say I'll accept a row or two of bad data just so I can read the whole file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
We draw a distinction between "well formed" and "invalid" data and whether one does validation seems like the right switch in daffodil to use.
If data is malformed, that means you can't successfully parse it. If it is invalid, that just means values are unexpected. Example: A 3 digit number representing a percentage 0 to 100. -1 is invalid, ABC is malformed.
If data is not well formed, you really cannot continue parsing it, as you cannot convert it to the type expected. But, if you are able to determine at least how big it is, it's possible to capture that length of data into a dummy "badData" element which is always invalid (so isn't a "false positive" parse). This capability has to be designed into the DFDL schema, but it is something we've been doing more and more.
Hence, one can tolerate even some malformed data. If it is malformed to where you cannot determine the length, then continuing is impossible.
We will see if more than this is needed. Options like the "use all strings/varchar" or all numbers are float, which you have for toleratng situations with other data connectors may prove useful, particularly while a DFDL schema is in development and you are really just testing it (and the corresponding data) using Drill.
@mbeckerle |
@mbeckerle Would you want to chat sometime next week and I can walk you through the UDF architecture? I don't know how relevant it would be, but you'd at least see how things are installed and so forth. |
@cgivre I believe the style issues are all fixed. The build did not get any codestyle issues. |
The issue I was referring to was more around the organization of a few classes. Usually we'll have the constructor (if present) at the top followed by any class methods. I think there was a class or two where the constructor was at the bottom or something like that. In any event, consider the issue resolved. |
This significantly simplifies the metadata walking to convert Daffodil metadata to drill metadata.
@cgivre @paul-rogers is there an example of a Drill UDF that is not part of the drill repository tree? I'd like to understand the mechanisms for distributing any jar files and dependencies of the UDF that drill uses. I can't find any such in the quasi-USFs that are in the Drill tree, because well, since they are part of Drill, and so are their dependencies, this problem doesn't exist. |
@mbeckerle Here's an example: https://github.com/datadistillr/drill-humanname-functions. I'm sorry we weren't able to connect last week. |
If I understand this correctly, if a jar is on the classpath and has drill-module.conf in its root dir, then drill will find it and read that HOCON file to get the package to add to drill.classpath.scanning.packages. Drill then appears to scan jars for class files for those packages. Not sure what it is doing with the class files. I imagine it is repackaging them somehow so Drill can use them on the drill distributed nodes. But it isn't yet clear to me how this aspect works. Do these classes just get loaded on the distributed drill nodes? Or is the classpath augmented in some way on the drill nodes so that they see a jar that contains all these classes? I have two questions: (1) what about dependencies? The UDF may depend on libraries which depend on other libraries, etc. (2) what about non-class files, e.g., things under src/main/resources of the project that go into the jar, but aren't "class" files? How do those things also get moved? How would code running in the drill node access these? The usual method is to call getResource(URL) with a URL that gives the path within a jar file to the resource in question. Thanks for any info. |
I believe that is correct.
So UDFs are a bit of a special case, but if they do have dependencies, you have to also include those JAR files in the UDF directory, or in Drill's 3rd party JAR folder. I'm not that good with maven, but I've often wondered about making a so-called fat-JAR which includes the dependencies as part of the UDF JAR file.
Take a look at this UDF. https://github.com/datadistillr/drill-geoip-functions
|
Ok, so the geo-ip UDF stuff has no special mechanisms or description about those resource files, so the generic code that "scans" must find them and drag them along automatically. That's the behavior I want. @cgivre What is "Drill's 3rd Party Jar folder"? If a magic folder just gets dragged over to all nodes, and drill uses a class loader that arranges for jars in that folder to be searched, then there is very little to do, since a DFDL schema can be just a set of jar files containing related resources, and the classes for Daffodil's own UDFs and layers which are java code extensions of its own kind. |
Uses JPrimType now, not strings.
This now passes all the daffodil contrib tests using the published official Daffodil 3.7.0. It does not yet run in any scalable fashion, but the metadata/data interfacing is complete. I would like to squash this to a single commit before merging, and it needs to be tested rebased onto the latest Drill commit. |
Creating a new squashed PR so as to avoid loss of the comments on this PR. |
Adding Daffodil to Drill as a 'contrib'
Requires Daffodil 3.7.0-SNAPSHOT which has metadata support we're using.
New format-daffodil module created
Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.)
We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.
The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.
Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects.
Tests show Daffodil works for data as complex as having nested repeating sub-records.
These DFDL types are supported:
#2835